2 00 3 Part - of - Speech Tagging with Minimal Lexicalization
نویسندگان
چکیده
We use a Dynamic Bayesian Network (dbn) to represent compactly a variety of sublexical and contextual features relevant to Part-ofSpeech (PoS) tagging. The outcome is a flexible tagger (LegoTag) with state-of-the-art performance (3.6% error on a benchmark corpus). We explore the effect of eliminating redundancy and radically reducing the size of feature vocabularies. We find that a small but linguistically motivated set of suffixes results in improved cross-corpora generalization. We also show that a minimal lexicon limited to function words is sufficient to ensure reasonable performance. 1 Part-of-Speech Tagging Many NLP applications are faced with the dilemma whether to use statistically extracted or expert-selected features. There are good arguments in support of either view. Statistical feature selection does not require extensive use of human domain knowledge, while feature sets chosen by experts are more economical and generalize better to novel data. Most currently available PoS taggers perform with a high degree of accuracy. However, it appears that the success in performance can be overwhelmingly attributed to an across-the-board lexicalization of the task. Indeed, Charniak, Hendrickson, Jacobson & Perkowitz (1993) note that a simple strategy of picking the most likely tag for each word in a text leads to 90% accuracy. If so, it is not surprising that taggers using vocabulary lists, with number of entries ranging from 20k to 45k, perform well. Even though a unigram model achieves an overall accuracy of 90%, it relies heavily on lexical information and is next to useless on nonstandard texts that contain lots of domain-specific terminology. The lexicalization of the PoS tagging task comes at a price. Since word lists are assembled from the training corpus, they hamper generalization across corpora. In our experience, taggers trained on the Wall Street Journal (wsj) perform poorly on novel text such as email or newsgroup messages (a.k.a. Netlingo). At the same time, alternative training data are scarce 2 VIRGINIA SAVOVA & LEONID PESHKIN and expensive to create. This paper explores an alternative to lexicalization. Using linguistic knowledge, we construct a minimalist tagger with a small but efficient feature set, which maintains a reasonable performance across corpora. A look at the previous work on this task reveals that the unigram model is at the core of even the most sophisticated taggers. The best-known rulebased tagger (Brill 1994) works in two stages: it assigns the most likely tag to each word in the text; then, it applies transformation rules of the form “Replace tag X by tag Y in triggering environment Z”. The triggering environments span up to three sequential tokens in each direction and refer to words, tags or properties of words within the region. The Brill tagger achieves less than 3.5% error on the Wall Street Journal (wsj) corpus. However, its performance depends on a comprehensive vocabulary (70k words). Statistical tagging is a classic application of Markov Models (mms) . Brants (2000) argues that second-order mms can also achieve state-of-theart accuracy, provided they are supplemented by smoothing techniques and mechanisms to handle unknown words. TnT handles unknown words by estimating the tag probability given the suffix of the unknown word and its capitalization. The reported 3.3% error for Trigrams ’n Tags (TnT) tagger on thewsj (trained on 10 words and tested on 10) appears to be a result of overfitting. Indeed, this is the maximum performance obtained by training TnT until only 2.9% of words are unknown in the test corpus. A simple examination of wsj shows that such percentage of unknown words in the testing section (10% of wsj corpus) requires simply building a unreasonably large lexicon of nearly all (about 44k) words seen in the training section (90% of wsj), thus ignoring the danger of overfitting. Hidden mms (hmms) are trained on a dictionary with information about the possible PoS of words (Jelinek 1985; Kupiec 1992). This means hmm taggers also rely heavily on lexical information. Obviously, PoS tags depend on a variety of sublexical features, as well as on the likelihood of tag/tag and tag/word sequences. In general, all existing taggers have incorporated such information to some degree. The Conditional Random Fields (crf) model (Lafferty, McCallum & Pereira 2002) outperforms the hmm tagger on unknown words by extensively relying on orthographic and morphological features. It checks whether the first character of a word is capitalized or numeric; it also registers the presence of a hyphen and morphologically relevant suffixes (-ed, -ly, -s, -ion, -tion, -ity, -ies). The authors note that crf-based taggers are potentially flexible because they can be combined with feature induction algorithms. HowPOS TAGGING WITH MINIMAL LEXICALIZATION 3 ever, training is complex (AdaBoost + Forward-backward) and slow (10 iterations with optimized initial parameter vector; fails to converge with unbiased initial conditions). It is unclear what the relative contribution of features is in this model. The Maximum Entropy tagger (MaxEnt, see Ratnaparkhi 1996) accounts for the joint distribution of PoS tags and features of a sentence with an exponential model. Its features are along the lines of the crf model: 1. Capitalization: Does the token contain a capital letter?; 2. Hyphenation: Does the token contain a hyphen?; 3. Numeric: Does the token contain a number?; 4. Prefix: Frequent prefixes, up to 4 letters long; 5. Suffix: Frequent suffixes, up to 4 letters long; In addition, Ratnaparkhi uses lexical information on frequent words in the context of five words. The sizes of the current word, prefix, and suffix lists were 6458, 3602 and 2925, respectively. These are supplemented by special Previous Word vocabularies. Features frequently observed in a training corpus are selected from a candidate feature pool. The parameters of the model are estimated using the computationally intensive procedure of Generalized Iterative Scaling (gis)to maximize the conditional probability of the training set given the model. MaxEnt tagger has 3.4% error rate. Our investigation examines how much lexical information can be recovered from sublexical features. In order to address these issues we reuse the feature set of MaxEnt in a new model, which we subsequently minimize with the help of linguistically motivated vocabularies. 2 PoS Tagging Bayesian Net Our tagger combines the features suggested in the literature to date into a Dynamic Bayesian Network (dbn). We briefly introduce the essential aspects of dbns here and refer the reader to a recent dissertation(Murphy 2002) for an excellent survey. A dbn is a Bayesian network unwrapped in time, such that it can represent dependencies between variables at adjacent time slices. More formally, a dbn consists of two models B and B, where B defines the initial distribution over the variables at time 0, by specifying: • set of variables X1, . . . , Xn; • directed acyclic graph over the variables; • for each variable Xi a table specifying the conditional probability of Xi given its parents in the graph Pr(Xi|Par{Xi}). 4 VIRGINIA SAVOVA & LEONID PESHKIN The joint probability distribution over the initial state is:
منابع مشابه
Improving part-of-speech tagging using lexicalized HMMs
We introduce a simple method to build Lexicalized Hidden Markov Models (L-HMMs) for improving the precision of part-of-speech tagging. This technique enriches the contextual Language Model taking into account a set of selected words empirically obtained. The evaluation was conducted with different lexicalization criteria on the Penn Treebank corpus using the TnT tagger. This lexicalization obta...
متن کاملHMM Specialization with Selective Lexicalization
We present a technique which complements Hidden Markov Models by incorporating some lexicalized states representing syntactically uncommon words. Our approach examines the distribution of transitions, selects the uncommon words, and makes lexicalized states for the words. We performed a part-of-speech tagging experiment on the Brown corpus to evaluate the resultant language model and discovered...
متن کاملar X iv : c s . C L / 0 31 20 60 v 1 2 7 D ec 2 00 3 Part - of - Speech Tagging with Minimal Lexicalization
We use a Dynamic Bayesian Network (dbn) to represent compactly a variety of sublexical and contextual features relevant to Part-ofSpeech (PoS) tagging. The outcome is a flexible tagger (LegoTag) with state-of-the-art performance (3.6% error on a benchmark corpus). We explore the effect of eliminating redundancy and radically reducing the size of feature vocabularies. We find that a small but li...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کامل